This week’s tidytuesday data comes from Pro Football Reference and includes attendance, standings, and game stats for each game. Well do a quick EDA and generate a few ideas of what might be interesting to look at.
# Import the data from tidytuesday: https://github.com/rfordatascience/tidytuesday
attendance <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/attendance.csv')
standings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/standings.csv')
games <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/games.csv')First, lets take a look at the data and see what it can tell us at face value.
| team | team_name | year | total | home | away | week | weekly_attendance |
|---|---|---|---|---|---|---|---|
| Arizona | Cardinals | 2000 | 893926 | 387475 | 506451 | 1 | 77434 |
| Arizona | Cardinals | 2000 | 893926 | 387475 | 506451 | 2 | 66009 |
| Arizona | Cardinals | 2000 | 893926 | 387475 | 506451 | 3 | NA |
| Arizona | Cardinals | 2000 | 893926 | 387475 | 506451 | 4 | 71801 |
| Arizona | Cardinals | 2000 | 893926 | 387475 | 506451 | 5 | 66985 |
| Arizona | Cardinals | 2000 | 893926 | 387475 | 506451 | 6 | 44296 |
| Name | Piped data |
| Number of rows | 10846 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| team | 0 | 1 | 5 | 13 | 0 | 32 | 0 |
| team_name | 0 | 1 | 4 | 10 | 0 | 32 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1.00 | 2009.53 | 5.75 | 2000 | 2005.0 | 2010 | 2015.00 | 2019 | ▇▇▇▇▇ |
| total | 0 | 1.00 | 1080910.03 | 72876.97 | 760644 | 1040509.0 | 1081090 | 1123230.00 | 1322087 | ▁▁▇▆▁ |
| home | 0 | 1.00 | 540455.01 | 66774.65 | 202687 | 504360.0 | 543185 | 578342.00 | 741775 | ▁▁▅▇▁ |
| away | 0 | 1.00 | 540455.01 | 25509.33 | 450295 | 524974.0 | 541757 | 557741.00 | 601655 | ▁▂▇▇▂ |
| week | 0 | 1.00 | 9.00 | 4.90 | 1 | 5.0 | 9 | 13.00 | 17 | ▇▆▆▆▇ |
| weekly_attendance | 638 | 0.94 | 67556.88 | 9022.02 | 23127 | 63245.5 | 68334 | 72544.75 | 105121 | ▁▁▇▃▁ |
We certainly have some interesting statistics here to mess with. It looks like the ‘total,’ ‘home,’ and ‘away’ attendance columns represent the entire season, while the ‘weekly_attendance’ field is for each week. Therefore, it looks like each row is a game. So there should be some variability in how many rows there are amongst the teams, as some make it to the playoff and some do not (therefore they’d have more games). A look at the summary statistics should tell us more… if that is the case or not.
So, as was expected, there does seem to be some variability associated with how many games they play. Not to worry, I’ll probably stay away from analyzing just that, and focus on either more aggregated descriptives or simply other statistics herein.
| team | team_name | year | wins | loss | points_for | points_against | points_differential | margin_of_victory | strength_of_schedule | simple_rating | offensive_ranking | defensive_ranking | playoffs | sb_winner |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Miami | Dolphins | 2000 | 11 | 5 | 323 | 226 | 97 | 6.1 | 1.0 | 7.1 | 0.0 | 7.1 | Playoffs | No Superbowl |
| Indianapolis | Colts | 2000 | 10 | 6 | 429 | 326 | 103 | 6.4 | 1.5 | 7.9 | 7.1 | 0.8 | Playoffs | No Superbowl |
| New York | Jets | 2000 | 9 | 7 | 321 | 321 | 0 | 0.0 | 3.5 | 3.5 | 1.4 | 2.2 | No Playoffs | No Superbowl |
| Buffalo | Bills | 2000 | 8 | 8 | 315 | 350 | -35 | -2.2 | 2.2 | 0.0 | 0.5 | -0.5 | No Playoffs | No Superbowl |
| New England | Patriots | 2000 | 5 | 11 | 276 | 338 | -62 | -3.9 | 1.4 | -2.5 | -2.7 | 0.2 | No Playoffs | No Superbowl |
| Tennessee | Titans | 2000 | 13 | 3 | 346 | 191 | 155 | 9.7 | -1.3 | 8.3 | 1.5 | 6.8 | Playoffs | No Superbowl |
| Name | Piped data |
| Number of rows | 638 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 11 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| team | 0 | 1 | 5 | 13 | 0 | 32 | 0 |
| team_name | 0 | 1 | 4 | 10 | 0 | 32 | 0 |
| playoffs | 0 | 1 | 8 | 11 | 0 | 2 | 0 |
| sb_winner | 0 | 1 | 12 | 13 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1 | 2009.53 | 5.76 | 2000.0 | 2005.00 | 2010.0 | 2014.75 | 2019.0 | ▇▇▇▇▇ |
| wins | 0 | 1 | 7.98 | 3.08 | 0.0 | 6.00 | 8.0 | 10.00 | 16.0 | ▂▆▇▆▂ |
| loss | 0 | 1 | 7.98 | 3.08 | 0.0 | 6.00 | 8.0 | 10.00 | 16.0 | ▂▆▇▆▂ |
| points_for | 0 | 1 | 350.28 | 71.40 | 161.0 | 299.00 | 348.0 | 396.00 | 606.0 | ▂▇▇▂▁ |
| points_against | 0 | 1 | 350.28 | 59.55 | 165.0 | 310.00 | 347.0 | 391.50 | 517.0 | ▁▃▇▆▁ |
| points_differential | 0 | 1 | 0.00 | 101.09 | -261.0 | -75.00 | 1.5 | 72.75 | 315.0 | ▂▆▇▅▁ |
| margin_of_victory | 0 | 1 | 0.00 | 6.32 | -16.3 | -4.70 | 0.1 | 4.57 | 19.7 | ▂▆▇▅▁ |
| strength_of_schedule | 0 | 1 | 0.00 | 1.63 | -4.6 | -1.10 | 0.0 | 1.20 | 4.3 | ▁▅▇▅▁ |
| simple_rating | 0 | 1 | 0.00 | 6.20 | -17.4 | -4.47 | 0.0 | 4.50 | 20.1 | ▁▆▇▅▁ |
| offensive_ranking | 0 | 1 | 0.00 | 4.34 | -11.7 | -3.18 | 0.0 | 2.70 | 15.9 | ▁▇▇▂▁ |
| defensive_ranking | 0 | 1 | 0.00 | 3.57 | -9.8 | -2.40 | 0.1 | 2.50 | 9.8 | ▁▅▇▅▁ |
So, look at this snapshot of the data, there seems to be all kinds of interesting statistics between attendance and what appears to be by team year statistics.
| year | week | home_team | away_team | winner | tie | day | date | time | pts_win | pts_loss | yds_win | turnovers_win | yds_loss | turnovers_loss | home_team_name | home_team_city | away_team_name | away_team_city |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2000 | 1 | Minnesota Vikings | Chicago Bears | Minnesota Vikings | NA | Sun | September 3 | 13:00:00 | 30 | 27 | 374 | 1 | 425 | 1 | Vikings | Minnesota | Bears | Chicago |
| 2000 | 1 | Kansas City Chiefs | Indianapolis Colts | Indianapolis Colts | NA | Sun | September 3 | 13:00:00 | 27 | 14 | 386 | 2 | 280 | 1 | Chiefs | Kansas City | Colts | Indianapolis |
| 2000 | 1 | Washington Redskins | Carolina Panthers | Washington Redskins | NA | Sun | September 3 | 13:01:00 | 20 | 17 | 396 | 0 | 236 | 1 | Redskins | Washington | Panthers | Carolina |
| 2000 | 1 | Atlanta Falcons | San Francisco 49ers | Atlanta Falcons | NA | Sun | September 3 | 13:02:00 | 36 | 28 | 359 | 1 | 339 | 1 | Falcons | Atlanta | 49ers | San Francisco |
| 2000 | 1 | Pittsburgh Steelers | Baltimore Ravens | Baltimore Ravens | NA | Sun | September 3 | 13:02:00 | 16 | 0 | 336 | 0 | 223 | 1 | Steelers | Pittsburgh | Ravens | Baltimore |
| 2000 | 1 | Cleveland Browns | Jacksonville Jaguars | Jacksonville Jaguars | NA | Sun | September 3 | 13:02:00 | 27 | 7 | 398 | 0 | 249 | 1 | Browns | Cleveland | Jaguars | Jacksonville |
From this vantage, it looks like I could connect the team name from the attendance data to the home team of the games data. To do so, I’ll have to make a key that matches. Most readily, it looks like I’ll need to combine a few the ‘team’ and ‘team_name’ columns in the attendance data.
att_reshape <- attendance %>%
mutate(t_name = str_c(team, team_name, sep = " ")) %>%
select(-team, -team_name)
# to merge these datasets, I needed 'week' to be numeric on both...
# and for whatever reason it was a character field to begin with...
games <- games %>%
mutate(week = as.numeric(week))
att_games <- left_join(att_reshape, games,
by = c('t_name' = 'home_team', 'year' = 'year', 'week' = 'week'))
kable(head(att_games))| year | total | home | away | week | weekly_attendance | t_name | away_team | winner | tie | day | date | time | pts_win | pts_loss | yds_win | turnovers_win | yds_loss | turnovers_loss | home_team_name | home_team_city | away_team_name | away_team_city |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2000 | 893926 | 387475 | 506451 | 1 | 77434 | Arizona Cardinals | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 2000 | 893926 | 387475 | 506451 | 2 | 66009 | Arizona Cardinals | Dallas Cowboys | Arizona Cardinals | NA | Sun | September 10 | 20:35:00 | 32 | 31 | 322 | 1 | 330 | 2 | Cardinals | Arizona | Cowboys | Dallas |
| 2000 | 893926 | 387475 | 506451 | 3 | NA | Arizona Cardinals | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 2000 | 893926 | 387475 | 506451 | 4 | 71801 | Arizona Cardinals | Green Bay Packers | Green Bay Packers | NA | Sun | September 24 | 16:06:00 | 29 | 3 | 455 | 1 | 209 | 4 | Cardinals | Arizona | Packers | Green Bay |
| 2000 | 893926 | 387475 | 506451 | 5 | 66985 | Arizona Cardinals | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 2000 | 893926 | 387475 | 506451 | 6 | 44296 | Arizona Cardinals | Cleveland Browns | Arizona Cardinals | NA | Sun | October 8 | 16:15:00 | 29 | 21 | 315 | 2 | 240 | 0 | Cardinals | Arizona | Browns | Cleveland |
At first glance it looks like there is a bunch of missing data, but because I merged on the home team, almost all the missingness appears to be related to the team is away or if they had a by week. Which makes it less alarming… I was alarmed at first… “did my join work?”
Nevertheless, I think we’re almost at a point to start visualizing stuff. It would be helpful to have a column identifying if a team is home or away… just so I can sort on it later.
att_games_ha <- att_games %>%
mutate(away_team_ind = if_else(is.na(away_team) & !is.na(weekly_attendance), 1, 0),
away_team_ind = case_when(
is.na(weekly_attendance) ~ 9999,
TRUE ~ away_team_ind))So, by doing all of this, I can see trends across time of how well a team does in terms of attendance between home and away games.
Lets take a look.
chiefs <- att_games_ha %>%
filter(t_name == "Kansas City Chiefs")
chiefs_gg <- chiefs %>%
filter(away_team_ind != 9999) %>%
group_by(year, week, away_team_ind) %>%
summarize(average_att = mean(weekly_attendance)) %>%
ggplot(aes(x = year, y = average_att)) +
geom_point() +
facet_wrap(~away_team_ind)
chiefs_ggWell, that’s interesting… but I wonder if the obvious dip in at home attendance between ~2007 and ~2013 is simply a localized artifact to the Chiefs.
all_gg <- att_games_ha %>%
filter(away_team_ind != 9999) %>%
group_by(t_name, year, away_team_ind) %>%
summarize(average_att = mean(weekly_attendance)) %>% ungroup() %>%
ggplot(aes(x = year, y = average_att, color = as.factor(away_team_ind))) +
geom_point() +
facet_wrap(~t_name) +
theme(legend.position = "top")
all_gggg1 <- att_games_ha %>%
filter(away_team_ind != 9999) %>%
group_by(t_name, year, week, away_team_ind) %>%
summarize(average_att = mean(weekly_attendance)) %>%
ggplot(aes(t_name, average_att, color = as.factor(away_team_ind))) +
geom_boxplot(outlier.alpha = .5) +
coord_flip() +
theme(legend.position = "top")
#facet_wrap(c("away_team_ind"))
gg1